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ABSTRACT 


The needs for precisely estimating a student’s academic per- 
formance have been emphasized with an increasing amount 
of attention paid to Intelligent Tutoring System (ITS). How- 
ever, since labels for academic performance, such as test 
scores, are collected from outside of ITS, obtaining the labels 
is costly, leading to label-scarcity problem which brings chal- 
lenge in taking machine learning approaches for academic 
performance prediction. To this end, inspired by the recent 
advancement of pre-training method in natural language 
processing community, we propose DPA, a transfer learn- 
ing framework with Discriminative Pre-training tasks for 
Academic performance prediction. DPA pre-trains two mod- 
els, a generator and a discriminator, and fine-tunes the dis- 
criminator on academic performance prediction. In DPA’s 
pre-training phase, a sequence of interactions where some to- 
kens are masked is provided to the generator which is trained 
to reconstruct the original sequence. Then, the discrimi- 
nator takes an interaction sequence where the masked to- 
kens are replaced by the generator’s outputs, and is trained 
to predict the originalities of all tokens in the sequence. 
We conduct extensive experimental studies on a real-world 
dataset obtained from a multi-platform ITS application and 
show that DPA outperforms the previous state-of-the-art 
generative pre-training method with a reduction of 4.05% 
in mean absolute error and more robust to increased label- 
scarcity." 
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1. INTRODUCTION 


Predicting a student’s future academic performance is a fun- 
damental task for developing modern Intelligent Tutoring 
System (ITS) which aims to provide personalized learning 


'For more detailed descriptions of experimental settings and 
results, please refer the arXiv version of this paper. 


Byungsoo Kim, Hangyeol Yu, Dongmin Shin and Youngduck 
Choi “Knowledge Transfer by Discriminative Pre-training for 
Academic Performance Prediction”. 2021. In: Proceedings of 
The 14th International Conference on Educational Data Mining 
(EDM21). International Educational Data Mining Society, 287-294. 
https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


Interactive 
Features Response - 


Student Elapsed time ITS 


eet? Database 


1. Take a test 


La 


Te 


st Center 


Test Score 


Student 2. Report a score 


Figure 1: Interactive features, such as student response and 
elapsed time for the response, are automatically recorded to 
the database whenever a student interacts with ITS. On the 
other hand, more complicated steps are necessary to obtain a 
test score: a student should take the test in the designated 
test center, receive the test score, and report the score to 
ITS. 


experience by supporting educational needs of each individ- 
ual. However, labels for academic performance, such as test 
scores, are often scarce since they are external to ITS. For 
example, as shown in Figure 1, test scores are not automati- 
cally collected inside of ITS. Obtaining a test score requires a 
student to take the test in the designated test center, receive 
the score, and report the score to ITS. Transfer learning is 
a commonly taken approach to address such label-scarcity 
problems across different domains of machine learning. In 
this framework, a model is first pre-trained to optimize aux- 
iliary objectives with abundant data, and then fine-tuned 
on the task of interest. In Artificial Intelligence in Educa- 
tion (AIEd) community, [3] introduced Assessment Model- 
ing (AM), a set of pre-training tasks for label-scarce educa- 
tional problems including academic performance prediction. 
AM proposed a pre-training method where first, a masked 
interaction sequence is generated by replacing a set of in- 
teractive features which can serve as criteria for pedagogi- 
cal evaluation with artificial mask tokens. Then, given the 
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masked interaction sequence, a model is pre-trained to pre- 
dict the masked interactive features. The idea was borrowed 
from the Masked Language Modeling (MLM) pre-training 
method proposed in [7]. In the MLM pre-training method, 
given a masked word sequence where some words in the se- 
quence are replaced with an artificial mask token, a model is 
pre-trained to predict the masked words. However, recently, 
[6] pointed out that the MLM pre-training method has poor 
sample efficiency and suffers from pre-train/fine-tune dis- 
crepancy due to the artificial mask token, and proposed a 
new discriminative pre-training method. Considering the 
problems are also inherent in AM, potential gains are ex- 
pected to be obtainable when the discriminative pre-training 
method is applied to academic performance prediction. 


To this end, we propose DPA, a transfer learning framework 
with Discriminative Pre-training tasks for Academic perfor- 
mance prediction. There are two models in DPA: a generator 
and a discriminator. In DPA’s pre-training phase, the gen- 
erator is trained to predict the masked interactive features 
in the same way as AM. Then, given a replaced interac- 
tion sequence which is generated by replacing the masked 
features with the generator’s outputs, the discriminator is 
trained to predict whether each token in the sequence is the 
same as the one in the original interaction sequence. After 
the pre-training, the generator is thrown away and only the 
discriminator is fine-tuned on academic performance predic- 
tion. Also, we investigate diverse pre-training tasks for the 
generator and show that pre-training the generator to pre- 
dict a student’s response is more effective than to predict 
the correctness and timeliness of their response which were 
considered as the most pedagogical interactive features in 
AM. Extensive experimental studies conducted on a real- 
world dataset collected from a multi-platform ITS appli- 
cation show that DPA outperforms AM with a reduction 
of 4.05% in Mean Absolute Error (MAE) and more robust 
when the degree of label-scarcity increases. 


2. SANTA: A SELF-STUDY SOLUTION 
EQUIPPED WITH AN AI TUTOR FOR 
ENGLISH EDUCATION 


In this paper, we conduct experiments on a real-world dataset 
obtained from Santa’, a multi-platform ITS with more than 
a million users in South Korea available through Android, 
iOS, and Web that exclusively focuses on the Test of English 
for International Communication (TOEIC) standardized ex- 
amination. The publicly accessible version of the dataset 
was released under the name EdNet [4]. The TOEIC con- 
sists of two timed sections, Listening Comprehension (LC) 
and Reading Comprehension (RC). There are a total of 100 
multiple choice exercises in each section, and the total score 
for each section is 495 in steps of 5 points. Santa provides 
learning experiences of solving exercises, studying explana- 
tions, and watching lectures. When a student consumes a 
specific learning content, Santa diagnoses their current aca- 
demic status based on their learning activities records and 
recommends another learning content appropriate for their 
current position. Santa records diverse types of interactive 
features, such as student response, the duration of time the 
student took to respond, and the time interval between the 
current and previous learning activities. However, unlike 


*https://aitutorsanta.com 


the interactive features automatically collected from Santa, 
obtaining the official TOEIC score requires more steps: a 
student should register and pay for the test, take the test 
in the designated test center, receive the test score from the 
Educational Testing Service, and report the score to Santa 
(Figure 1). Santa collected students’ TOEIC score data by 
offering small gifts to students when they report their scores. 


3. TRANSFER LEARNING FOR ACADEMIC 
TEST PERFORMANCE PREDICTION 


To overcome the label-scarcity problem in academic test per- 
formance prediction, we consider burgeoning machine learn- 
ing discipline of transfer learning. There is an open issue of 
what information to transfer or which pre-training task is 
the most effective for academic test performance prediction. 
Previous studies proposed two types of pre-training meth- 
ods for AIEd Tasks: interaction-based method which mod- 
els students’ dynamic learning behaviors [13, 15, 8, 3], and 
content-based method which learns representations of learn- 
ing contents [14, 23, 19, 24, 29]. [3] showed that interaction- 
based pre-training method outperforms content-based pre- 
training methods when the pre-trained model is fine-tuned 
on several label-scarce educational tasks including academic 
test performance prediction. Following this line of research, 
we propose a transfer learning framework where a model is 
pre-trained using only student interaction data, and fine- 
tune the pre-trained model on academic test performance 
prediction. In this paper, we consider the following interac- 
tive features: 


e eid: A unique ID assigned to an exercise solved by a 
student. There are a total of 14419 exercises in the 
dataset. 


e part: Each exercise belongs to a specific part that rep- 
resents the type of the exercise. There are a total of 7 
parts in the TOEIC. 


e response: Since the TOEIC consists of multiple choice 
exercises and there are four options for each exercise, 
a student response for a given exercise is one of the 
options, ‘a’, ‘b’, ‘c’, or ‘d’. 


e correctness: Whether a student responded correctly 
to a given exercise. Note that correctness is a coarse 
version of response since correctness is processed by 
comparing response with a correct answer for a given 
exercise. 


e elapsed_time: The amount of time a student spent on 
solving a given exercise. 


e timeliness: Whether a student responded to a given 
exercise under the time limit. Note that timeliness is 
a coarse version of elapsed_time since timeliness is pro- 
cessed by comparing elapsed_time with the time limit 
recommended by domain experts for a given exercise. 


e exp_time: The amount of time a student spent on 
studying an explanation for an exercise they had solved. 


e inactive_time: The time interval between the current 
and previous interactions. 
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Figure 2: The overall pre-training/fine-tuning process of DPA when each token in an interaction sequence is a set of eid, part, 
and response, and response is a feature being masked. mask and cls are special tokens for mask and classification, respectively, 


which are the same as the ones used in [7]. 


In our experiments, we normalize the values of elapsed_time, 
exp_time, and inactive_time so they are between 0 and 1 to 
stabilize the training process. 


4. PROPOSED METHOD 


Figure 2 depicts our proposed method. There are two mod- 
els in DPA: a generator and a discriminator. In pre-training 
phase, given a sequence of interactions I = [h,..., Jr], 
where each interaction I; = {fi,..., ff} is a set of inter- 
active features f;, such as eid, part, and response, a masked 
interaction sequence I” = [Ii",..., I] is generated by 
first randomly selecting a set of positions to mask M = 
{Mi,..., Mm} (m < T), and for the masked position Mi, 
masking out a fixed set of features {fiiz,,---, fz, } (n < k). 
For instance, in Figure 2, if the original interaction sequence 
is [(e419, part4, b), (e23, part3, c), (e4324, part3, a), (e5233, 
part1, a)] where each token in the sequence is a set of eid, 
part, and response, a masked interaction sequence where 
M = {2,3} and response as a masked feature is [(e419, 
part4, b), (e23, part3, mask), (e4324, part3, mask), (e5233, 
partl, a)]. Then, the generator takes the masked interac- 
tion sequence I™ as an input, and outputs predicted values 
og for the masked features fj,;,. After that, a replaced 
interaction sequence [® = Re ..., 17] is generated by re- 
placing the masked features fu, with the generator’s pre- 
dictions o§. In Figure 2, since the generator’s outputs 
for the masked features are ‘b’ and ‘a’, a replaced inter- 
action sequence is [(e419, part4, b), (e23, part3, b), (e4324, 
part3, a), (e5233, partl, a)]. Then, the discriminator takes 
the replaced interaction sequence J* as an input, and pre- 
dicts whether each token in the sequence is the same as the 
one in the original interaction sequence (original) or not (re- 
placed). After the pre-training, we throw away the generator 
and fine-tune the pre-trained discriminator on academic test 
performance prediction. We provide detailed explanations of 
each component in the generator and the discriminator, and 
training objective functions in the following subsections. 


4.1 Interaction Embeddings 


The embedding layer produces a sequence of interaction em- 
bedding vectors by mapping each interactive feature to an 
appropriate embedding vector. We take two different ap- 
proaches to embed the interactive features depending on 
whether they are categorical (eid, part, response, correctness, 
and timeliness) or continuous (elapsed_time, exp_time, and 
inactive_time) variables. If an interactive feature is a cate- 
gorical variable, we assign unique latent vectors to possible 
values of the feature including special values for mask (mask) 
and classification (cls). Take response as an example, there 
is an embedding matrix Exesponse € IRoX4emb where each row 
vector is assigned to one of ‘a’, ‘b’, ‘c’, ‘d’, mask, and cls. 
If an interactive feature is a continuous variable, we assign 
a single latent vector to the feature. Then, an embedding 
vector for the feature is computed by multiplying the latent 
vector and a value of the feature. For instance, we compute 
an embedding vector for elapsed_time as et * Ectapsed_time, 
where et is a specific value and Fetapsedtime € Riéemb is 
a latent vector assigned to elapsed_time. Also, mask and 
classification for the continuous interactive features are in- 
dicated by setting their values to -1 and 0, respectively. Not 
only embeddings for interactive features, positional embed- 
dings are also incorporated into Transformer-based models 
[27] to consider chronological order of each token. Rather 
than using conventional positional embeddings which stores 
an embedding vector for every possible position, we adopt 
axial positional embeddings [17] to further reduce memory 
usage. The final interaction embedding vector of dimension 
demp for each time-step is the sum of all embedding vectors 
in the time-step. The interaction embedding layer is shared 
by both the generator and the discriminator. 


4.2 Performer Encoder 

Since its successful debut in Natural Language Processing 
(NLP) community, Transformer’s attention mechanism has 
become a common recipe adopted across different domains of 
machine learning including speech processing [18], computer 
vision [1, 9], and AIEd [21, 2, 11, 22]. Compared to Recur- 
rent Neural Network (RNN) family models, Transformer’s 
attention mechanism has benefits of capturing longer-range 
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Figure 3: The reversible layer in the Performer encoder is 
composed of the FAVOR-+-based multi-head attention layer 
and the point-wise feed-forward layer. 


dependencies and allowing parallel training, which enables 
the model to achieve better performance with less training 
time. However, despite these advantages, the time and mem- 
ory complexities of computing the attention grow quadrat- 
ically with respect to input sequence length, requiring de- 
manding computing resources for training the model on long 
sequences. For instance, if L is input sequence length and d 
is dimension of query, key, and value vectors, Transformer’s 
attention is computed as follows: 


= 
Attention(Q, K,V) = softmax (27) Vz 


where Q, K,V € R”*?. The time and memory complexities 
for computing QK' in the above equation are O(L*d) and 
O(L?), respectively. Therefore, the cost for training Trans- 
former becomes prohibitive with large L, preventing training 
the model even on a single GPU. 


The problem of improving the efficiency of Transformer’s at- 
tention mechanism is a common concern of machine learn- 
ing community. Recent studies have proposed several meth- 
ods to reduce the computing complexities lower than the 
quadratic degree with respect to input sequence length [17, 
28, 16, 25, 5]. In this paper, we adopt Performer [5] since 
it uses reasonable memory and makes a better trade-off be- 


tween speed and performance [26]. Performer approximates 
attention kernels through Fast Attention Via positive Or- 
thogonal Random features (FAVOR+) approach. For those 
who want to know more about FAVOR-4, please refer [5]. 


With the efficient attention mechanism by FAVOR+, we 
propose the Performer encoder which is stacks of several 
identical reversible layers described in Figure 3. The re- 
versible layer is based on Reversible Transformer [12, 17] 
architecture to further improve memory efficiency in back- 
propagation. An input of the reversible layer « € R&™*4nidaen 
is first chunked to 21,22 € REX ¢hidden/? Then, scaled lg 
normalization (ScaleNorm) [20] and FAVOR+-based multi- 
head attention layer (MultiHeadAttn) are applied to x2, and 
the result is added to x1 to compute yi € REX ¢hidden/? 


y. = 21 + MultiHeadAttn(ScaleNorm(z2)). 


After that, the scaled lz normalization and point-wise feed- 
forward layer (FeedForward) are applied to y1, and the result 
is added to x2, computing yo € RY X¢hidden/? 


y2 = 2 + FeedForward(ScaleNorm(y1)). 


An output of the reversible layer y € R&*7hédd4en ig a con- 
catenation of y; and y2. We stack the reversible layer mul- 
tiple times to allow the final model to sufficiently represent 
underlying data distribution. 


4.3 Generator 

The generator computes hidden representations [Ag, ee AG] 
by feeding the masked interaction sequence I” to a se- 
ries of the interaction embedding layer (InterEmbedding), a 
point-wise feed-forward layer (GenFeedForward1), the Per- 
former encoder (GenPerformerEncoder), and another point- 
wise feed-forward layer (GenFeedForward2): 


Ue, ..., 14”) = InterEmbedding([I,... , 17"]) 

[Age ,..., hE" ] = GenFeedForward1({I",..., I“”]) 
At’ bie 5 he” = GenPerformerEncoder({h¢", ae Ae”) 
[ 


h¢,...,hG] = GenFeedForward2({h¢”,...,hF"]), 


where I!” , nS € Reem and hGF , nGP E€ Résen-niaden | Then, 
depending on whether the masked features are categorical 
or continuous variables, generator outputs are computed dif- 
ferently. If the masked features are categorical variables, the 
outputs are sampled from a probability distribution defined 
by the following softmax layer: 


OF ms Pa(fi, r) = softmax(E;h{, )- 


If the masked features are continuous variables, the outputs 
are computed by the following sigmoid layer: 


On = sigmoid(E} hr, )- 


Similar to the case of categorical masked features, one can 
sample the outputs from a probability distribution defined 
by I™ and parameters of the generator when the masked fea- 
tures are continuous variables. For instance, the outputs can 
be sampled from the Gaussian distribution where the mean 
and the variance are determined by J” and the generator’s 
parameters. However, we make the outputs deterministic 
because sampling the outputs underperforms in our prelim- 
inary experiments when the masked features are continuous 
variables. 
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4.4 Discriminator 

In pre-training, outputs of the discriminator OP = (oP, ides 
OF] is computed by applying a series of the interaction em- 
bedding layer (InterEmbedding), a point-wise feed-forward 
layer (DisFeedForward1), the Performer encoder (DisPer- 
formerEncoder), and another point-wise feed-forward layer 
eta tit to the replaced interaction sequence I”: 


Rehan L”] = InterEmbedding([I{’,... , [#]) 

PP. eae DisFeedForward1([I{’”,..., 17”]) 
[hy ,..., he] = DisPerformerEncoder([h?”,...,he*]) 
[(OP,...,O7] = DisFeedForward2([h7”,...,h7"]), 


where ae € Reem, APF APP © Rédis ridden, OP ER, and 
the sigmoid is applied to the last layer of the discriminator. 
After the pre-training, we slightly modify the discriminator 
by replacing the last layer with a layer having appropriate 
dimension for academic test performance prediction. 


4.5 Training Objectives 
The objective for pre-training is to minimize the following 
loss function: 


m n T 
S~ 55 GenLoss(O%, fir,) +A }_ DisLoss(OP, 11" = 11), 


i=1 j=1 t=1 


where GenLoss is the cross entropy (or mean squared error) 
loss function if the masked features are categorical (or con- 
tinuous) variables, DisLoss is the binary cross entropy loss 
function, and 1 is the identity function. For ease of nota- 
tion, we omit an index for each input sample in the above 
equation. If there are more than one masked features in 
each time-step (n > 1), the generator is trained under the 
multi-task leaning scheme. The objective for fine-tuning is 
to minimize the mean squared error loss between the model’s 
predictions and score labels. 


5. EXPERIMENTS 


5.1 Effects of Generator’s Pre-training Tasks 
There are multiple interactive features to be masked in each 
token of the interaction sequence, which raises a question 
of how to construct a set of masked interactive features, 
and accordingly, which pre-training task for the generator is 
the most effective for academic test performance prediction. 
By default, all interactive features listed in Section 3 are 
taken as inputs for both the generator and discriminator. 
However, if response (or elapsed_time) is masked, correctness 
(or timeliness) is excluded from the inputs and vice versa 
since there is an overlap of information that the features 
represent. For example, when both response and correctness 
are taken as inputs, and correctness is masked, the generator 
can predict the masked correctness by only looking at eid 
and response without considering other interactions, which 
leads to poor pre-training. The results are described in Table 
Ls 


The best result was obtained under the pre-training task of 
predicting response alone, which is slightly better than that 
of predicting correctness, and both response and correctness. 
Predicting correctness of student response is an important 
task in AIEd as can be seen from the large volume of stud- 
ies about Knowledge Tracing. Also, [3] empirically showed 


Table 1: Comparison between different pre-training tasks. 


Pre-training task MAE 

response 50.65 + 1.26 
response + elapsed_time 54.86 + 1.64 
response + timeliness 52.91 + 1.38 
response + exp_time 57.54 + 1.47 
response + inactive_time 60.69 + 1.74 
correctness 51.36 + 0.97 
correctness + elapsed_time 53.36 + 1.43 
correctness + timeliness 52.60 + 1.20 
correctness + exp_time 54.36 + 1.62 
correctness + inactive_time 55.04 + 1.58 
response + correctness 51.13 + 1.60 
response + correctness + elapsed_time | 52.15+ 1.43 
response + correctness + timeliness 53.05 + 1.81 
response + correctness+ exp_time 53.09 + 1.25 
response + correctness + inactive_time | 56.41 + 1.72 


that student response correctness is the most pedagogical 
interactive feature for academic test performance predic- 
tion. However, rather than pre-training a model to predict 
whether a student correctly responded to a given exercise, 
the pre-training task of predicting student response itself 
injects more fine-grained information into the model, which 
leads to the more effective pre-training for academic test 
performance prediction. Interestingly, the underperformed 
results were obtained when predicting elapsed_time or time- 
liness in pre-training despite the benefits their information 
bring to several AIEd tasks [10, 30, 22]. We hypothesize that 
elapsed_time and timeliness may introduce irrelevant noises 
and thus guide the model towards a direction inappropri- 
ate for academic test performance prediction. In the case of 
exp_time and inactive_time, we observed that the generator 
failed to learn to predict their values when only given the 
interactive features listed in Section 3, which leads to unsta- 
ble pre-training. From these observations, in the following 
subsections, we conduct experimental studies based on the 
pre-training task of predicting response alone. 


5.2 DPA vs. Baseline Methods 
We compare DPA with the following pre-training methods: 


e No pre-training: We train the fine-tuning models only 
on the fine-tuning dataset. 


e Autoencoding: Autoencoding (AE) is a generative pre- 
training method widely used across different domains 
of machine learning including AIEd [13, 8]. Given an 
unmasked interaction sequence, AE pre-trains a model 
to reconstruct the input interaction sequence. 


e Assessment Modeling: Assessment Modeling (AM) [3] 
is the previous state-of-the-art generative pre-training 
method for academic test performance prediction. In 
AM, a model takes a masked interaction sequence as 
an input and is pre-trained to predict masked features. 
AM is exactly the same as fine-tuning the pre-trained 
generator in DPA. 


Also, we investigate whether DPA is effective with the fol- 
lowing different fine-tuning models: 
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Table 2: Comparison of DPA with baseline methods. 


Pre-training method | Fine-tuning model MAE 

No pre-training MLP 82.89 + 3.23 
BiLSTM 84.05 + 2.06 
Transformer encoder | 107.06 + 2.52 
Performer encoder 81.76 + 1.24 

AE MLP 79.46 + 1.15 
BiLSTM 85.64 + 1.89 
Transformer encoder | 75.13 + 3.10 
Performer encoder 64.80 + 1.43 

AM MLP 7717+ 2.14 
BiLSTM 58.16 + 1.28 
Transformer encoder | 57.16 + 2.08 
Performer encoder 52.79 + 1.39 

DPA MLP 77.24 + 1.59 
BiLSTM 57.59 + 1.76 
Transformer encoder | 55.99 + 1.62 
Performer encoder 50.65 + 1.26 


e MLP: Multi-Layer Perceptron (MLP) is stacks of sim- 
ple fully-connected layers. Given an interaction se- 
quence, interaction embedding vectors of all time-steps 
are summed together to compute a fixed-dimensional 
vector which is fed to a series of the fully-connected 
layers. 


e BiLSTM: Bi-directional Long Short-Term Memory (BiL- 


STM) is a model widely used for time series data pre- 
diction tasks. The global max pooling layer is ap- 
plied on top of the BiLSTM layer to obtain a fixed- 
dimensional intermediate representation from an input 
sequence of varying length. 


e Transformer Encoder: Transformer Encoder is a series 
of several identical layers composed of a multi-head 
self-attention layer with the softmax attention kernel 
and a point-wise feed-forward layer. We set the Trans- 
former encoder’s attention window size to 512 due to 
the out of GPU memory occuring when training the 
Transformer encoder of 1024 attention window size on 
our single GPU machine. 


As described in Table 2, transferring the pre-trained knowl- 
edge brings better results in most cases, and the best result 
is obtained from DPA. Especially, when the Performer en- 
coder, the best performing fine-tuning model, is used as the 
fine-tuning model, DPA reduces MAE by 4.05%, 21.84%, 
and 38.05% compared to AM, AE, and No pre-training, re- 
spectively. Among the baseline pre-training methods ex- 
cluding No pre-training, the worst result is obtained from 
AE beacuse the pre-training task of AE is much easier than 
that of AM and DPA. We observed that the loss curve of 
AE converged to near zero within the first pre-training eval- 
uation. 


5.3. Robustness to Increased Label-scarcity 

Since the motivation behind our proposal of DPA is the 
label-scarcity problem, we investigate how MAE changes 
at varying degrees of label-scarcity. Figure 4 and Table 3 
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Figure 4: The black, blue, and red lines represent MAEs 
for No pre-training, AM, and DPA, respectively, when the 
number of fine-tuning training samples becomes 1/2, 1/4, 
and 1/8 of the entire dataset. 


Table 3: Comparison of DPA with AM and No pre-training 
at varying degrees of label-scarcity. 


N No pre-training | AM DPA 

1/8 | 94.21 + 8.40 60.22 + 1.86 | 55.90 + 1.97 
1/4 | 89.01 + 2.14 57.08 + 1.75 | 53.46 + 1.45 
1/2 | 85.374 1.15 54.29 + 1.50 | 51.38+ 1.16 
Full | 81.76 + 1.24 52.79 £1.39 | 50.65 + 1.26 


describe the results when using 1/2, 1/4, and 1/8 of the 
total number of fine-tuning training samples. In all de- 
grees of label-scarcity, DPA consistently outperforms AM. 
Also, DPA fine-tuned on 1/2, 1/4, and 1/8 of the dataset 
outperforms AM fine-tuned on the entire dataset, 1/2, and 
1/4 of the dataset, respectively, which shows that DPA is 
more robust to label-scarcity than AM. Compared with No 
pre-training, the gap between No pre-training and the other 
two pre-training methods increases as the number of labels 
becomes scarce. Furthermore, the other two pre-training 
methods fine-tuned on 1/8 of the dataset outperform No 
pre-training fine-tuned on the entire dataset. 


6. CONCLUSION 


In this paper, we proposed DPA, a transfer learning frame- 
work with discriminative pre-training tasks for academic 
performance prediction. Our experimental results showed 
the effectiveness of DPA for the label-scarce academic per- 
formance prediction task over the previous state-of-the-art 
generative pre-training method. Avenues of future research 
include investigating more effective pre-training tasks for 
academic performance prediction and pre-train/fine-tune re- 
lations in AIEd. 
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